Predicting Bike Rental Demand Using Regression Analysis

Author

Annmarie Thomson, Christina Zhang, Arya Sardesai

Published

March 15, 2025

Summary

In our project, we will attempt to build a regression model using best subset selection to analyze bike-sharing data to predict rental demand. We will examine factors like weather, time, and holidays to understand their influence on bike usage.

Introduction

Bike-sharing systems are an integral part of urban transportation (Winters, 2020). Understanding the factors driving bike-share demand can help urban planners optimize services. The dataset we used to form our regression is the Bike Sharing Dataset (dataset ID: 275) from the UCI Machine Learning Repository (Fanaee-T (2013)). It contains information on bike rentals, weather conditions, and time-related features. Our research question: How many bikes will be rented on a day based on weather and temporal factors?

Software & Packages

The R programming language (R Core Team 2024) and the following R packages were used to perform the analysis: knitr (Xie 2024), tidyverse (Wickham et al. 2023), tidymodels (Kuhn and Wickham 2025), ucimlrepo (Dua and Graff 2024), leaps (Lumley 2024), mltools (Sailo 2018), and ggpubr (Kassambara 2023).

Methods and Results

Table 1: Predictor variables used for analysis
Predictor Variable Description
season Season that the bike is rented in
holiday If the day the bike was rented is a holiday
workingday If the day the bike was rented is a work day
weathersit What the weather was on the day the bike was rented
temp What the temperature was on the day the bike was rented
hum What the humidity was on the day the bike was rented
windspeed What the windsped was on the day the bike was rented

Our dataset was loaded and cleaned by ensuring correct factorization and removing irrelevant columns. We had no missing data or special characters so we did not have to worry about that.

In our exploratory analysis, we looked to see how dependent variables affected bike rental usage Figure 1. We also made a correlation matrix, Figure 3, to explore how correlated our variables are. We found multicollinearity between atemp and temp, so moving forward we will use temp in our analysis. Finally, we found that the distribution of bike rental counts was heavily right skewed. Because we plan to use linear regression and we want to maintain the assumption of normality, moving forward we will be using a log transformation on the cnt variable.

Figure 1: Distributions of dependent variables vs bike rental counts
Figure 2: Distribution of bike rental counts
Figure 3: Correlation matrix of prediction variables

The data was split into training (75%) and testing (25%) sets, stratified by cnt (total bike counts). To ensure we had enough data representation in the test set, we computed the median, mean, and standard deviation for both data sets to make sure they were similar, which can be seen in Table 2.

Table 2: Summary statistics for response variable (cnt) for each data split.
partition fraction mediant_cnt mean_cnt sd_cnt….sd.cnt..na.rm…TRUE.
Train 0.75 143 189.5325 180.9092
Test 0.25 139 189.2548 182.8360

To determine the most appropriate model, we used the best subset framework, seen here Table 3. Because weather is a categorical variable, we needed to check if the model with or without weather did better to determine our final model.

Table 3: Model with largest R2 and adjusted R^2
R2 Adj.R2
7 7

We created two linear regression models with and without weather respecively to assess their impact on bike demand. Because the model that included weather had a higher adjusted R^2 (Table 4) , we decided to use that model as our final regression model, as seen in Table 5.

Table 4: Comparing model with and without weather
Adj.R2_with Adj.R2_without
0.2613957 0.2594724
Table 5: Final model summary
term estimate std.error statistic p.value
(Intercept) 4.2765471 0.0627358 68.167583 0.0000000
season 0.1630897 0.0109232 14.930624 0.0000000
holiday -0.2016433 0.0705456 -2.858342 0.0042654
workingday -0.0707891 0.0250026 -2.831269 0.0046435
weathersit 0.1161699 0.0196582 5.909492 0.0000000
temp 2.5676392 0.0620024 41.411931 0.0000000
hum -2.6133252 0.0687452 -38.014652 0.0000000
windspeed 0.5399395 0.0978825 5.516199 0.0000000

To assess the model fit, we generated a residual plot, Figure 4. This plot indicates that even with our log transformation, the residuals are a bit heteroscedastic, and in future renditions of this project we plan to adopt a different, more appropriate model.

Figure 4: Residual plot of final model

Finally, to evaluate prediction accuracy we calculated RMSE in Table 6, which we found to be 1.29 uses approximately, suggesting the model prediction is good and our model is useful.

Table 6: RMSE of our linear modely
RMSE
259.6163

Discussion

We found that the ideal model for our data includes season, holiday status, wether it is a working day, the temperature, the humidity, and the wind speed. We found that our model became stronger with the inclusion of weather-related variables. Though none of these findins are individually surprising, we were surprised that all of the variables had an impact on bike demand prediction and wonder if more research can be done into what other variables may also be used in this model. These findings suggest that these variables can significantly influence bike demand, information that can be used to help increase total users.

Future Questions:

Could a non-linear model be more accurate in terms of prediction?

How do long-term weather trends affect the seasonal bike usage?

What other outside variables are impactful in the prediction of bike-share usage?

Reference

Dua, Dheeru, and Casey Graff. 2024. UCI Machine Learning Repository. https://cran.r-project.org/package=ucimlrepo.
Fanaee-T, Hadi. 2013. Bike Sharing.” UCI Machine Learning Repository.
Kassambara, Alboukadel. 2023. Ggpubr: ’Ggplot2’ Based Publication Ready Plots. https://cran.rstudio.com/package=ggpubr.
Kuhn, Max, and Hadley Wickham. 2025. Tidy Modeling with r. https://CRAN.R-project.org/package=tidymodels.
Lumley, Thomas. 2024. Leaps: Regression Subset Selection. https://cran.rstudio.com/package=leaps.
R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing. https://www.r-project.org/.
Sailo, Daniel. 2018. Mltools: Machine Learning Tools. https://cran.rstudio.com/package=mltools.
Wickham, Hadley et al. 2023. The Tidyverse: Easily Install and Load the Tidyverse. https://CRAN.R-project.org/package=tidyverse.
Xie, Yihui. 2024. Knitr: A General-Purpose Package for Dynamic Report Generation in r (version 1.49). https://cran.r-project.org/package=knitr.